# HEROv2: Full-Stack Open-Source Research Platform for Heterogeneous Computing

Andreas Kurth, Student Member, IEEE, Björn Forsberg, and Luca Benini, Fellow, IEEE

Abstract—Heterogeneous computers integrate general-purpose host processors with domain-specific accelerators to combine versatility with efficiency and high performance. To realize the full potential of heterogeneous computers, however, many hardware and software design challenges have to be overcome. While architectural and system simulators can be used to analyze heterogeneous computers, they are faced with unavoidable compromises between simulation speed and performance modeling accuracy.

In this work we present HEROv2, an FPGA-based research platform that enables accurate and fast exploration of heterogeneous computers consisting of accelerators based on clusters of 32-bit RISC-V cores and an application-class 64-bit ARMv8 or RV64 host processor.

HEROv2 allows to seamlessly share data between 64-bit hosts and 32-bit accelerators and comes with a fully open-source on-chip network, a unified heterogeneous programming interface, and a mixed-data-model, mixed-ISA heterogeneous compiler based on LLVM. We evaluate HEROv2 in four case studies from the application level over toolchain and system architecture down to accelerator microarchitecture. We demonstrate how HEROv2 enables effective research and development on the full stack of heterogeneous computing. For instance, the compiler can tile loops and infer data transfers to and from the accelerators, which leads to a speedup of up to 4.4× compared to the original program and in most cases is only 15 % slower than a handwritten implementation, which requires 2.6× more code.

## 1 Introduction

ETEROGENEOUS integrated computing systems aim **⊥** to combine general-purpose computing with domainspecific, efficient processing capabilities [1]–[3]. Such computers integrate a general-purpose host processor with specialized programmable many-core accelerators (e.g., [4]–[6]). These systems are very complex and many challenges remain to be overcome to realize their full potential [7]. Central questions over the full stack of computing, from application programming over compilers and runtime libraries down to the accelerator microarchitecture, include: How to partition tasks between host and different accelerators? How to express that partitioning in programming languages, optimize it in the toolchain, or both? How to manage data sharing across host and accelerator, share address spaces and overcome the differences between cache-coherent memory subsystems, typically found on the host, and their non-coherent counterparts, which are typically found in accelerators? Which types and combinations of accelerators are optimal for a given domain?

Research on heterogeneous systems traditionally follows a double-track approach, where accelerators are developed in isolation [8], [9], and their impact on system-level performance is estimated through analytical models and simulators [10], [11]. Compared to using a prototype heterogeneous system, this approach has significant drawbacks: First, interactions between host, accelerators, the memory hierarchy, and peripherals are complex to model accurately, making accurate simulation orders of magnitude slower than running prototypes. Second, even full-system simulators model heterogeneous computers to a limited degree only [12]. For example, models of system-level interconnects or system memory management units (SMMUs), are missing or highly abstract and imprecise. Third, simulators that are not precisely calibrated and accuracy-validated against the simulated system are generally too inaccurate to provide reliable results, and fullsystem simulators are particularly unreliable [13]. A research platform that serves as a working prototype, on the other

hand, enables collaborative and accurate architectural analysis and optimization [14]. To perform system-level research using standard benchmarks and real-world applications, the platform must additionally provide a software stack that includes an application programming interface and a complete compiler toolchain.

Existing research platforms do not meet these requirements in their entirety. Many provide a custom accelerator on programmable logic [15], [16], and some even couple the accelerator to a host processor that runs an operating system [17], [18]. HEROv1 [19] provides software stack and compiler that enable the evaluation of real-world applications on a mixed-instruction set architecture (ISA) computer, but it fundamentally restricts host and accelerator to use the same data model (e.g., 32-bit). Additionally, HEROv1's on-chip network and memory subsystem are restricted to simple architectures that cannot meet the demands of modern heterogeneous computers.

In this work, we make three main contributions:

1) We resolve the mentioned limitations and present HEROv2, an open-source research platform where an application-class 64-bit host can seamlessly share data with a 32-bit parallel programmable accelerator. The latter is implemented on programmable logic and based on permissively licensed open-source RTL¹ components. The hardware components can be freely extended and modified and include a high-performance end-to-end on-chip network that can be fully customized to meet the memory demands of the accelerator and target application (§ 2.1). The platform also includes a complete heterogeneous compiler based on LLVM, which allows single-source single-binary development of heterogeneous applications

1. RTL = register-transfer level. We use the industry-standard SystemVerilog hardware description language.

with OpenMP 4.5 offloading (§ 2.2). The runtime libraries on the accelerator and driver on the host enable this offloading with little overhead (§ 2.3). A unified heterogeneous application programming interface (API) enables productive programming while providing fine-grained control where necessary (§ 2.4). The complete software stack and tools are open source under a permissive license.

- 2) We demonstrate the capabilities of HEROv2 by using it to study four current research topics in heterogeneous computing and provide quantitative insights on the level of applications (§ 3.1), toolchains (§ 3.2), system architecture (§ 3.3), and accelerator architecture (§ 3.4).
- 3) We also leverage HEROv2 to design and evaluate a novel solution to one of the most pressing problems in heterogeneous computing: how to relieve the programmer of the burden of specializing an algorithm to the memory hierarchy of the accelerator (§ 3.2).

Section 2 describes HEROv2's hardware and software platform. Section 3 focuses on case studies. Furthermore, we compare with related work in § 4 and conclude in § 5.

## 2 PLATFORM

HEROv2 consists of a complete heterogeneous hardware architecture (§ 2.1) as well as an end-to-end software stack including a toolchain and compilers (§ 2.2), operating system and runtime libraries (§ 2.3), and an application programming interface (§ 2.4).

#### 2.1 Hardware Architecture

HEROv2's hardware architecture combines a general-purpose CPU running a full operating system (OS) with a domain-specific programmable multi- or many-core accelerator. Fig. 1 gives an overview of the components and their connections. As many hardware components as possible are implemented on an FPGA (also called *programmable logic*) to make them configurable, modifiable, and extendable. All hardware modules mapped on the FPGA are available in synthesizable register-transfer level (RTL) logic under a permissive opensource license, which makes them fully analyzable and freely extensible and reusable. The vast majority of hardware components is silicon-proven, meaning they have been and will be used in ASIC tapeouts in many modern silicon technologies.

The host CPU is a hard-macro 64-bit ARMv8 Cortex-A multi-core on Xilinx' Zynq UltraScale+ family FPGAs or a soft-macro 64-bit RISC-V core (CVA6 architecture [20]) on any UltraScale+ family FPGA. The general design principle of the host core is to maximize performance per area or power on mostly sequential workloads with a complex control flow. Each host core features a private L1 instruction cache, L1 data cache, and a memory management unit (MMU). All host cores are attached to a coherent interconnect and share an L2 data and instruction cache.

Host and accelerator share an off-chip main memory (DDR DRAM or high bandwidth memory (HBM)) through the system interconnect, which can be coherent to the caches of the host. The memory hierarchy of the accelerator consists of software-managed scratch-pad memorys (SPMs). To copy

data into and out of the SPMs, the accelerator features direct memory access (DMA) engines. To share the virtual address space of an application running on the host, each accelerator features a hybrid IOMMU (such as [21]). This IOMMU consists mainly of a translation lookaside buffer (TLB), which translates virtual user-space application addresses to physical memory addresses and supports a high degree of concurrency (e.g., tens of outstanding transactions from different masters). The TLB is managed by the accelerator itself, which handles TLB misses by walking the application page table managed by the host and filling the corresponding entries into the TLB. The IOMMU is called *hybrid* because it is managed in software, which allows the accelerator to efficiently share virtual address pointers with a minimum amount of hardware (e.g., for buffers).

The accelerator is composed of many minimal 32-bit RISC-V cores, which are organized into clusters of 4 to 16 cores for scalability. Different RISC-V core architectures are supported (see Table 1), and consequentially the specific ISA of the accelerator varies, but all accelerator cores support at least the RV32IMA ISA. The focus of the accelerator core architecture is to maximize the performance per area or power on computationheavy workloads with a simple control flow. For this reason, the cores feature a single-issue in-order pipeline with 1 to 4 stages. To accelerate floating-point workloads, each core can be extended with a floating-point unit (FPU), which is highly parametrizable: depending on the needs of the application, it can execute one double-precision (fp64) multiply-accumulate (MAC), one or two single-precision (fp32) MACs, two to four half-precision (fp16) MACs, or four to eight quarter-precision (fp8) MACs in one clock cycle [22].

To accelerate workloads that heavily rely on functions outside common integer or floating-point operations, the cores support custom bit-manipulation instructions, and the cluster can additionally be extended with fixed-function hardware processing engines (e.g., [23]). To maximize the utilization of the compute units, each accelerator core supports custom instructions to repeat a sequence of instructions multiple times without branches (so-called *hardware loops*) as well as custom instructions to implicitly increment the memory address on a load or store.

Within each accelerator cluster, the cores have single-cycle access to a multi-banked, tightly-coupled L1 data SPM. A default banking factor of two allows any core to access any bank in any cycle with a low probability of contention for most applications. The cores can additionally access memory outside the own cluster, including shared main memory, with a latency between a few (to other clusters) to hundreds of cycles (to main memory, depending on on-chip network and memory controller). A custom control and status register (CSR) allows each 32-bit core to load from and store to any 64-bit address [24, § 5]. This CSR extends the native 32-bit address by 32 upper bit and is set automatically by the compiler (see § 2.2.1).

The cores fetch their instructions from an L1 instruction cache, which is shared by all cores in one cluster. To reduce the pressure on the shared instruction cache during loops, each core additionally contains an L0 instruction cache holding up to eight compressed instructions.

Finally, each accelerator cluster features a DMA engine, which can address the full 64-bit memory space, supports



Figure 1. HEROv2's hardware architecture, which combines a general-purpose *host* processor (in the upper left corner) with a domain-specific programmable many-core *accelerator* (on the right side) so that data in main memory can be shared effectively (in the lower left corner).

unified virtual memory through the hybrid IOMMU [25], can transfer up to 1024 bit per clock cycle in and out of the cluster (full duplex), and can have tens of transactions, each consisting of tens of data beats, outstanding at any time. This DMA engine allows to transfer data in high-bandwidth bursts while the accelerator cores compute on data in local memory. If an application allows issuing sufficiently many or long bursts, the DMA engine allows tolerating a latency of hundreds of cycles between main memory and accelerator, which is crucial to support the ongoing trend of deeper and non-uniform memory hierarchies.

Multiple accelerator clusters are interconnected with two non-coherent networks: a wide one for high-bandwidth DMA transfers and a narrow one for low-latency accesses by cores [26]. A high-bandwidth on-chip SRAM controller connects the L2 SPM, which is shared by all clusters, to the accelerator interconnect. The hybrid IOMMU connects the accelerator to the host. The IOMMU is either directly attached to a non-coherent system interconnect or via a bridge (such as [27]) as I/O-coherent request node to a coherent system interconnect. To program and control the accelerator from the host, it is additionally connected as I/O-coherent slave node to the system interconnect.

## 2.2 Toolchain and Compilers

HEROv2's heterogeneous hardware requires toolchain support to enable the development of applications for the platform in an efficient and productive way. HEROv2 provides such a heterogeneous toolchain, based on LLVM 9², which provides efficient support for heterogeneous compilation based on OpenMP. This enables the seamless co-integration of compute-focused accelerator kernels and control-focused host code into a unified application, including target-specific compilation and optimizations. Additionally, the different data widths of the system (64-bit host and 32-bit accelerator) are supported by LLVM's address space implementation, which provides the compiler with the means to express pointers of varying width. An overview of HEROv2's toolchain is shown in Fig. 2.

The toolchain flow starts by compiling OpenMP-annotated heterogeneous source code, as shown at the left of the figure. OpenMP describes heterogeneity and parallelism through #pragmas and leaves the transformation to parallel code to the compiler. Annotating a piece of code with #pragma omp target

directs the toolchain to compile the code both for the host<sup>3</sup> and the accelerator. We refer to these regions as *target regions*. The respective host and device toolchains are thereby invoked by the *Clang driver*, as shown in Fig. 2, which transform the source code into an object file for each architecture. The Clang driver triggers the device linker for the device object file, creating a RISC-V executable and linkable format (ELF) file, whereafter the host linker is triggered. The host linker first links the host object files into an ELF file and then embeds the device ELF as an object inside the host ELF, creating a *FAT binary*. This allows the OpenMP runtime to load the device ELF into accelerator memory at runtime.

HEROv2 uses an off-the-shelf LLVM-based 64-bit toolchain for the host and a custom LLVM-based 32-bit RISC-V toolchain for the device. Both toolchains are marked in Fig. 2, where the components that include customizations are highlighted in yellow tones, and aim to provide interoperability between host and accelerators, ease of programming, and support for ISA extensions.

# 2.2.1 Interoperability between Host and Accelerators

Pointers in C/C++, as well as in the LLVM intermediate representation (IR), have a fixed width: the *data width* of the target processor. A 32-bit data width of an accelerator therefore implies that 64-bit pointers from the host will be truncated. To allow 64-bit host pointers to be correctly represented, an additional 64-bit *address space* is defined in HEROv2's accelerator compiler. We refer to the two address spaces as the 32-bit *native address space* and the 64-bit *host address space*. Address space support is a built-in LLVM feature, and has been previously used, e.g., to separate pointers to global and shared memory in CUDA. In such cases, however, pointers are annotated by the programmer, e.g., \_\_shared\_\_ in CUDA, and all pointers typically have the same width.

To address mixed-data-width compilation in HEROv2, the *Clang frontend* has been extended to generate LLVM IR with automatically assigned address spaces. We adopt the techniques of [24], where OpenMP offloading entry points are used to infer that pointers passed to a device kernel from the host are 64-bit wide. The use of such pointers are then tracked throughout the application, such that any pointer that *cannot* be guaranteed to never hold a 64-bit host address is *promoted* to the host address space. Any pointer that is guaranteed to

3. By OpenMP's specification, the runtime decides during execution time if a target region is executed on the host or the accelerator, but in the case of HEROv2, the latter is always the case.



Figure 2. Overview of HEROv2's heterogeneous toolchain and compilers.

only hold 32-bit pointers is kept in the native address space. Additional control is handed to the programmer through \_\_device pointer decorations, to enforce a pointer to belong to the native address space, if the compiler could not guarantee it to be correct. As part of machine code generation, any data types and operations that are not natively supported by the underlying hardware and/or application binary interface (ABI) must be *legalized*. As pointer semantics are dropped in LLVM backends (i.e., pointers are treated as integers), the backend is able to implicitly legalize arithmetic operations on 64-bit pointers. However, the backend does not support the legalization of wider-than-native load and store operations. HEROv2's RISC-V compiler has therefore been extended with a custom host pointer legalizer pass right before the optimized code is passed to the RISC-V backend for machine code generation. This pass identifies all load and store operations on addresses in the host address space and implements them using the address extension CSR.

# 2.2.2 Ease of Programming and Code Portability

An important aspect for code portability and ease of programming is the automatic optimization of code for the memory hierarchy of a computer. HEROv2's accelerators use software-managed SPMs, which are refilled using DMA engines. This means software must explicitly orchestrate any data movements between shared main memory and fast local memory. As OpenMP does not provide any mechanisms to tile data structures and move tiles with DMA transfers, programmers need to manually rewrite their code to perform well on SPM-based accelerators. HEROv2's DMA API is unified over all accelerators, but the initial tiling of an application is nonetheless a significant effort and reduces code portability outside HEROv2.

To reduce this effort and improve code portability, HEROv2's device compiler provides an optional *AutoDMA* plugin that automatically analyzes source code to identify memory regions that are suitable for staging through SPMs and transforms the code to automatically program the DMA engine without any programmer intervention. The AutoDMA plugin is also able to perform loop tiling to extract segments of code whose memory footprint is small enough to fit in the local memory. The AutoDMA plugin is an extension of *HePREM* [28], originally envisioned for transforming real-time GPU code to be less sensitive to memory interference. This was achieved by transforming GPU kernels into a series of *load*, *execute*, and *store* phases, with explicit synchronization points between them. These three phases are well aligned with

accelerators based on software-managed SPMs. In contrast to *HePREM*, which targets DMA-less GPU systems, *AutoDMA* generates DMA API calls instead of moving data using load and store instructions. Additionally, synchronization has been minimized to improve performance. The resulting *AutoDMA* plugin provides an optional way to achieve performance on HEROv2 without the need for manual tiling and DMA management code.

## 2.2.3 Support for ISA Extensions

The device compiler backend of HEROv2 has been extended to support the ISA extensions supported by the RV32 cores of the accelerator. This includes the automatic detection and insertion of *hardware loop* instructions, automatic optimization to generate *post-increment* load and store instructions, as well as pattern matching to emit *multiply-accumulate* instructions, outlined in § 2.1. To the best of our knowledge, this is the first time custom instructions have been implemented for RISC-V in LLVM, and a full-system performance evaluation is shown in the case study in § 3.4.

In summary, HEROv2's heterogeneous toolchain provides a de-facto standard and heterogeneous-by-design programming model via OpenMP, which is fully support by its LLVM-based compiler. This provides seamless, end-to-end single-source-to-heterogeneous-binary compilation. The device compiler has been significantly extended to support performance, ease of programming, and code portability: first, through the minimization of expensive wider-than-native load and store operations in a mixed data-model setting; second, through the support for automatic tiling and DMA management through *AutoDMA*; and third, through automatic code generation targeting the performance-oriented ISA extensions supported by the underlying hardware.

## 2.3 Runtime Libraries and Operating System Support

HEROv2's runtime software stack is designed to seamlessly integrate the accelerators into the OS running on the host and allow for transparent accelerator programming with OpenMP 4.5 offloading [29] and unified virtual memory compliant with HSA specifications [30]. An overview of the runtime stack is shown in Fig. 3. This section discusses the layers below the API, which is discussed separately in § 2.4.

A heterogeneous application starts executing on the host. When the host encounters a #pragma omp target directive, it offloads the code within the target region to the specified (or default) accelerator (called *device* in OpenMP terminology).



Figure 3. HEROv2's runtime stack, which seamlessly integrates *accelerators* (with their runtime stack on the right) into the OS running on the *host* (runtime stack on the left) to enable heterogeneous applications with transparent offloading of accelerated kernels (at the top).

To this end, the host OpenMP runtime library invokes the device-specific runtime plugin. The plugin passes a pointer to the offloaded code and data to a hardware mailbox in the device, thereby starting execution on the device. The first core of the first cluster of the device runs an offload manager. It is woken by an interrupt from the hardware mailbox and starts executing the offloaded function. All data items inside the map clause become available to the device: When unified virtual memory is enabled, pointers are passed unmodified, no data is copied, and the device is given read-only access to the userspace page table of the application on the host. Otherwise, the host copies data to a physically contiguous memory region in main memory and changes the pointers before passing them to the device. By design, offloading does not copy data to the SPMs of the device. There are two main reasons for this: First, HEROv2's accelerator model aims at accelerating kernels that take at least a few ten thousand cycles to execute. Thus, the offloading model is relatively coarse grained and the mapped data in its entirety in general does not fit into the local memory of the device. Second, OpenMP's map clauses cannot express tiling, yet flexible tiling is essential for efficient execution on device-local memory.

Inside the offloaded region, execution starts on the first core of the first cluster. When that core encounters a #pragma omp teams directive, it forks execution to multiple clusters, and the cluster master core (i.e., the first core of each cluster) starts executing the region. When a cluster master core encounters a #pragma omp parallel directive, it forks execution to multiple cores of its cluster. Inside parallel regions, all OpenMP worksharing, datasharing, and synchronization constructs are available, allowing for effective parallel programming following OpenMP's standard paradigm. The OpenMP device runtime library implements the \_\_kmpc\_\* functions emitted by the compiler by calling into the accelerator-specific hardware abstraction library (HAL).

The virtual memory management (VMM) library allows the accelerator to share the virtual address space of a user-space application running on the host (concept of [21]). After the host has set up entries in the IOMMU that allow the accelerator to access the page table, the VMM library provides functions to translate any valid virtual address to a physical

address and set up a corresponding translation entry in the hybrid IOMMU. Application programmers usually do not notice this: The compiler generates the correct instructions for accessing pointers outside the native (32-bit physical) address space of the accelerator. In the common case, such accesses hit in the TLB of the IOMMU and incur an overhead of only three cycles per remote memory access [24]. When an access misses in the TLB, the core either invokes the VMM library itself to add an entry to the IOMMU, or it lets a dedicated core handle the misses. The latter is preferable for pointer-based applications, and miss handling can be configured per offload through custom options to the target region. The implementation of the VMM library is specific to the virtual memory system of the host (e.g., ARM VMSAv8-64 or RISC-V Sv39 or Sv48).

The HAL on the accelerator provides functions for forking parallel execution, identifying and synchronizing cores, putting cores to sleep and waking them up, controlling the DMA engine, and communicating between clusters and with the host through the mailbox. The HAL is implemented using low-level hardware-specific primitives, such as writing memory-mapped registers and setting bits in CSRs.

The OS device driver and the accompanying user-space accelerator library on the host implement the accelerator-specific functionality for offloading to and communicating with the accelerator from the host. This includes identifying the accelerator in the device tree, resetting, initializing, and programming it, and making the page table of the user-space process readable for the accelerator.

In summary, HEROv2's modular runtime stack supports different hosts and accelerators while reusing large parts of the code base, combining flexibility with accelerator-specific specialization. On the accelerator, all runtime libraries are linked into the offloaded application, and link-time optimization (LTO) minimizes the overhead of the multiple layers. On the host, system calls are required to trigger and conclude an offload, but the overhead of that is negligible due to HEROv2's coarse-grained offloading model.

# 2.4 Application Programming Interface

The application level is the most important from the perspective of end users and application developers. HEROv2's toolchain and runtime software provide the means to make effective use of accelerators, but without a properly designed API applications on heterogeneous computers remain too complex to program in most cases. Porting an application to make efficient use of the software-managed memory of an accelerator involves tiling data and scheduling data transfers, which are difficult tasks in general. An API alone cannot solve this problem, but it can make the work of the application programmer portable over different accelerators and substantially easier by abstracting the intricacies of the hardware away. The design goal is to provide an interface that is unified over all supported accelerators together with an implementation that is optimized and verified for each accelerator individually. HEROv2's API complements the OpenMP API for offloading and parallel programming (§ 2.3) and the accelerator-specific compiler (§ 2.2), which optimizes the compute part of an application for the target accelerator.

HEROv2's API has three main categories of functionality: memory management for the different SPM levels, data

transfers between SPMs and main memory, and performance measurements. All functions are thread-safe and can thus be used inside and outside parallel regions.

To manage the heap memory of the accelerator, there are three functions for each SPM level:  $hero_1N_capacity$  returns the currently available heap memory at SPM level N. This function is often used at the beginning of a tiling region to calculate the tile sizes.  $hero_1N_malloc$  and  $hero_1N_free$  implement POSIX' memory allocation and freeing functions [31] for SPM level N. The implementation uses a deterministic constant-complexity memory allocator [32], [33], ensures mutual exclusivity among all affected cores (e.g., within the same cluster for L1 SPM) through RISC-V atomic operations, and can detect heap overflows with a canary mechanism. The alignment and minimum allocation granule is  $8\,\mathrm{B}$ .

To transfer data between SPMs and main memory, HEROv2 provides multiple functions with the semantics of POSIX' memcpy [31]. Those functions are organized in three dimensions: direction (device-to-host or host-to-device), synchronicity (blocking or asynchronous), and transfer dimensionality (1D, 2D, etc.). The direction has to be distinguished in the function signature because pointers and addresses in the host-managed main memory are of a different width and address space than device-internal pointers: in hero\_memcpy\_host2dev\_\* functions, the src pointer is in the host address space and the dst pointer in the device address space, and vice-versa for the hero\_memcpy\_dev2host\_\* functions.

The synchronicity distinguishes functions that return as soon as the DMA engine has been programmed (with \_async suffix) or after all data has been transferred (without suffix). The asynchronous functions allow to start a DMA transfer and then work on different data while the DMA engine completes the transfer. Those functions return a unique transfer identifier, which has to be passed to the hero\_memcpy\_wait function to guarantee transfer completion before the data can be used. Multi-dimensional transfers allow to scatter and gather non-contiguous data with a single function call. For instance, the hero\_memcpy2d\_\* functions copy N sequences of B bytes from src to dst and apply a different address offset to src and dst after each sequence. This scatter-gather functionality is essential for tiling (e.g., to gather the rows of a tile of a 2D matrix from main memory into a dense SPM buffer before computation and scatter them back after computation). Whenever the DMA engine supports multi-dimensional transfers, they are executed directly by the DMA hardware; otherwise, they are implemented in software.

To measure the performance of applications and their execution on hardware, HEROv2 provides functions that provide a uniform interface to different hardware performance monitors and counters. The functions are mainly designed for hardware counters to which an event is assigned dynamically, which is common in modern processors. The available events range from monotonic clock cycles over memory accesses and stalls to memory and interconnect contention and utilization metrics. The hero\_perf\_alloc function allocates a counter for a given event and resets that counter. If the event is not supported by the hardware or the hardware counters are exhausted, the function returns an error. At the start of a program section to be investigated, a call to hero\_perf\_continue\_all starts all allocated counters, and

at the end of that section, hero\_perf\_pause\_all stops them. Those two functions execute with the minimal latency and overhead supported by the hardware (often as a single inlined CSR write instruction), allowing for precise, finegrained, and minimally intrusive performance measurements, which are crucial for identifying bottlenecks and systematic optimization.

## 3 EVALUATION

| Configuration    | Aurora           | Blizzard       | Cyclone        |  |  |
|------------------|------------------|----------------|----------------|--|--|
| Host ISA         | ARMv8.0-         | RV64GC         |                |  |  |
| Host Core Arch.  | Cortex-A5        | CVA6 [20]      |                |  |  |
| Host # Cores     | 4                | 1              |                |  |  |
| Accel. ISA       | RV32IMAFCXpulpv2 | RV32IMAF       | XssrXfrepXsdma |  |  |
| Accel. Core Arch | CV32E40P [34]    | Sr             | Snitch [35]    |  |  |
| Accel. # Cores   | 8                | 32             |                |  |  |
| Main Mem. Cap.   | 4 GiB DDR        | 8 GiB HBM2E    |                |  |  |
| Main Mem. BW     | up to 19.2 G     | up to 460 GB/s |                |  |  |
| Carrier Silicon  | Xilinx ZU9I      | Xilinx VU37P   |                |  |  |
| Carrier Freq.    | 50 MHz           | 25 MHz         |                |  |  |
| Status           | mature           | in development |                |  |  |

Table 1. Current target platforms and configurations of HEROv2.

In this section, we evaluate the most mature configuration of HEROv2: As host, it features an industry-standard quad-core 64-bit ARMv8 Cortex-A53 processor with 32 KiB L1 instruction and 32 KiB L1 data cache per core and an 1 MiB L2 cache shared by all four cores, implemented as hard macro and clocked at 1.2 GHz. As programmable many-core accelerator (PMCA), it features an octa-core 32-bit RISC-V floatingpoint accelerator (OpenHW CV32E40P core architecture) with 128 KiB L1 SPM and support for custom instructions (RV32IMAFCXpulpv2), implemented as soft-macro in the programmable logic (PL). Host and PMCA are connected through a lightweight IOMMU, which allows the PMCA to share the host's virtual memory space and which is implemented as soft-macro in PL, to a shared DRAM controller. The shared main memory consists of 4 GiB DDR4 DRAM, which provides up to 19.2 GB/s of bandwidth.

The implementation of PMCA and IOMMU on the PL of a Xilinx Zynq UltraScale+ ZU9EG system-on-chip (SoC) achieves a clock frequency of 50 MHz (without any FPGAspecific optimizations). The frequency is mainly limited by paths from the *request* output of the load/store unit (LSU) of an accelerator core through the cluster interconnect to the arbitrator of a memory bank and back to grant input of the LSU of another core. Among the available PL resources, the configurable logic blocks (CLBs) are the limiting factor with 98.1% utilization, of which 87.7% are used by the PMCA and 10.4% by the IOMMU. Within the PMCA, the cores (each of which includes an FPU), dominate with 38.4% of the total CLBs. 24.2 % of the block RAM tiles and 2.9 % of the DSP slices are used. We used Xilinx Vivado 2019.2 with the Alternate Routability synthesis strategy and the Congestion-Spread Logic-Low implementation strategy.

Variants of HEROv2 with alternative host processors and PMCAs are in development, and an overview of current configurations of HEROv2 and their status is shown in Table 1. The *Blizzard* configuration shares the host and the carrier silicon with the *Aurora* configuration evaluated here but features an octa-core RISC-V machine learning training (MLT) accelerator (RV32IMAFDXssrXfrepXsdma) with variable

precision support for 8 to 64 bit floating-point numbers. The *Cyclone* configuration targets a larger carrier silicon, on which a multi-cluster configuration of the MLT accelerator fits together with a 64-bit RISC-V host CPU. This configuration will not only offer higher accelerator performance but also an open-source soft-macro host CPU, which contrasts with the "black box" hard-macro Cortex-A53 host CPU of *Aurora* and *Blizzard*.

| Kernel  | Accelerated computation                                                                                           | Complexity $\mathcal{O}()$ |         |  |
|---------|-------------------------------------------------------------------------------------------------------------------|----------------------------|---------|--|
|         |                                                                                                                   | space                      | comput. |  |
| 2mm     | $C_{i,j} = \sum_{k=1}^{N} \alpha A_{i,k} B_{k,j}$                                                                 | $N^2$                      | $N^3$   |  |
| 3mm     | $\begin{split} E &= 2\mathrm{mm}(A,B) \to F = 2\mathrm{mm}(C,D) \\ \to G &= 2\mathrm{mm}(E,F) \end{split}$        | $N^2$                      | $N^3$   |  |
| atax    | $B_i = \sum_{j=1}^{N} A_{i,j} X_j$<br>$\rightarrow Y_i = \sum_{j=1}^{N} A_{j,i} B_j$                              | $N^2$                      | $N^2$   |  |
| bicg    | $Q_i = \sum_{j=1}^{N} A_{i,j} P_j$<br>$\rightarrow S_j = \sum_{i=1}^{N} R_i A_{i,j}$                              | $N^2$                      | $N^2$   |  |
| conv2d  | $B_{i,j} = \sum_{(k,l)=(-1,-1)}^{(1,1)} c_{k,l} A_{i+k,j+l}$                                                      | $N^2$                      | $N^2$   |  |
| covar   | $E_{j} = \alpha \sum_{i=1}^{M} D_{i,j}; D_{i,j} = E_{j};$<br>$S_{i,j} = S_{j,i} = \sum_{k=1}^{M} D_{k,i} D_{k,j}$ | $N^2$                      | $N^3$   |  |
| darknet | $C_{i,j} = \sum_{k=1}^{N} \alpha A_{i,k} B_{k,j}$                                                                 | $N^2$                      | $N^3$   |  |
| gemm    | $C_{i,j} = \beta \left( \sum_{k=1}^{N} \alpha A_{i,k} B_{k,j} \right)$                                            | $N^2$                      | $N^3$   |  |

Table 2. Evaluated kernels and applications. Subscripts denote indices, uppercase letters are variables, and lowercase letters are constants. Arrows  $(\rightarrow)$  denote consecutive offloads. Semicolons (;) denote consecutive computations within the same offload.

The evaluated applications and kernels, listed in Table 2, represent a wide range of accelerator workloads. From the Polybench/ACC benchmark suite [36], 2mm, 3mm, atax, bicg, and gemm are linear algebra kernels, conv2d is part of the "stencil" domain, and covar is part of the "datamining" domain. Together, these commonly accelerated kernels span a wide range of memory acess patterns and operational intensities. Additionally, darknet is an end-to-end real-time object detection application that implements the YOLO convolutional neural network (CNN) [37]. The data for all applications resides in host-managed shared DRAM. 3mm, atax, bicg, and darknet (one layer at a time) are composed of consecutive offloads, denoted by arrows  $(\rightarrow)$  in the table; all other kernels consist of a single offload. All benchmarks are compiled with -03 but no specific optimization flags. We take the time stamps of each accelerated application on the host, and it thus includes all data transfers and synchronization between host and accelerator. In all case studies, the accuracy of all results is fully maintained and verified. In all experiments, the host CPU runs Linux 4.19.0 on a root file system generated with Buildroot 2019.02.1, and we compile applications with LLVM 9.0.0 (extended as described in § 2.2).

#### 3.1 Application-Level Case Study

We begin with a case study on the application level. For each of the applications introduced above, we want to answer the following questions: How should the local memory of the accelerator be partitioned and data transfers organized so that the run time is dominated by computations on local memory? What is the speed-up compared to letting the accelerator load and store data directly from off-chip main memory? How should the application be parallelized over the cores in the accelerator, and what is the speed-up from parallelization?

The first two questions hold the key for making effective use of any accelerator with software-managed local memory.



Figure 4. Speed-up of execution on local memory with handwritten DMA transfers compared to execution on external main memory. Single accelerator thread.

To answer them, we divide input and output data into tiles. Assuming all data have the same dimensionality D, the side length of one tile is given by  $S = \lfloor (L/N)^{1/D} \rfloor$ , where N is the number of data elements (such as different vectors or matrices) and L is the capacity of the L1 for user data in number of words. With the evaluated accelerator architecture and runtime,  $L=28\,\mathrm{Ki}$  single-precision (i.e.,  $4\,\mathrm{B}$ ) words can be stored in L1. Tiling an algorithm is a non-trivial problem to which there is no general solution. We describe the tiling of one algorithm in the following to give an intuition, and we make the source code of all benchmarks available for full transparency and reproducibility (see link in conclusion).

For the convolutional layers in darknet, which are implemented as matrix-matrix multiplications, the tile side length of the two input matrices A and B and the output matrix C is S = 97. We loop over the tiles of A and transfer the current tile to L1. Within that loop, we loop over the tiles of B corresponding to the current horizontal dimension of A and transfer the corresponding tile of B and C in, perform the tiled matrix-matrix multiplication, and transfer the resulting tile of C out. The other arithmetic kernels are implemented in an analogous manner. As the left-hand scale in Fig. 4 shows, this reduces the run time compared to loading and storing directly from off-chip main memory by  $5.3 \times$  for darknet specifically and by  $4.3\times$  on average<sup>4</sup>. While this scheme does not exploit double buffering and the nonblocking DMA transfers that the platform is capable of, the share of cycles spent on DMA transfers is negligible (max: 1.9%, average: 0.2%), as the right-hand scale of Fig. 4 shows.

Every application lends itself differently to tiling and DMA transfers: In applications with high spatial locality, in particular when computation accesses data in the same sequence as it is stored in memory and it does so for large consecutive arrays, the DMA engine can transfer long continuous data bursts. This is particularly common in linear algebra and CNN kernels: the kernels with the highest speedup in Fig. 4 are all from those domains.

In applications with low spatial locality or divergent access patterns, DMA transfers are substantially shorter and thus offer lower speed-up. Nonetheless, the DMA engine's capability for gather-scatter transfers and many outstanding requests offers a speed-up of more than  $4\times$  even with low spatial locality. Temporal locality, on the other hand, has an even bigger impact: For some applications, tiling necessitates that each data element is loaded multiple times because local memory is not large enough to hold all data elements between two use instants. covar is an example of such an application,

4. Whenever we discuss the *average* of normalized numbers, we mean the *geometric mean* (denoted *geomean* in the figures), as reasoned in [38].

where each element of the data matrix has to be loaded twice (once during mean calculation and once while computing the covariance matrix). This reload factor of two reduces the speed-up by DMA transfers by almost  $2\times$  to only  $2.2\times$ .



Figure 5. Speed-up of execution with 8 accelerator threads compared to execution with 1 accelerator thread. Execution on local memory with handwritten DMA transfers.

The third question - how an application should be parallelized - holds the key for making effective use of any parallel accelerator. HEROv2's OpenMP runtime library enables to answer this question efficiently be experimentation: For the computation on one tile, we simply annotate the outermost computational loop with #pragma omp for to distribute its execution over the cores of the accelerator. As the left bar for each application in Fig. 5 shows, this reduces the computation cycles by 6.5 to  $7.1\times$  (average:  $6.9\times$ ) on an 8-core cluster. Even higher speed-ups by parallelization could be achieved by optimizing the loop schedule and stride, but we set that aside because it is sensitive to data size. The overall application speed-up by parallelization, shown by the middle bar for each application, is between 5.9 to  $7.1\times$ (average:  $6.7\times$ ). The right bar shows why the computationonly speed-up cannot be achieved for the overall application: The DMA transfers are not sped up by parallelization, so their share on the total cycles increases by the overall speed-up factor. Due to Amdahl's law, this limits the overall speed-up achievable by parallelization. On average, 2.2 % cycles spent on DMA transfers result in a modest decrease from  $7.0 \times$  to  $6.6\times$ . However, for some applications, such as covar, 10.3%cycles spent on DMA transfers reduce the parallelization speed-up from  $7.4 \times$  to  $6.6 \times$ . This may justify a more complex double-buffered implementation of an application.

This benchmark analysis shows how HEROv2's full-stack hardware and software allows to rapidly explore and optimize the accelerated performance of domain-relevant applications on a heterogeneous computer prototype: The high emulation throughput allows to study realistic problem sizes, and the complete software stack allows to adapt and tune real-world applications and representative kernels with reasonable effort and make informed optimization decisions. Furthermore, the fully open hardware implementation allows tracing and profiling hardware, as well as optimizing it.

#### 3.2 Runtime-Level and Toolchain Case Study

Tiling an algorithm for efficient execution on accelerator-local memory is not only an intellectual effort but also requires extra code to be written, verified, and maintained. HEROv2's API is designed to simplify this task for device-specific operations such as DMA transfers, which can be executed with a single function call, and fork-join parallelism, which is available through the standardized OpenMP pragmas that call into the runtime library. However, the part of tiling that is specific to

each algorithm cannot be substantially simplified by a runtime library.



Figure 6. Code complexity with handwritten tiling and DMA transfers compared to the unmodified code of each application. The light red bars show McCabe's cyclomatic complexity and the dark red bars show lines of code without comments.

The code complexity increase by handwritten tiling and DMA transfers compared to unmodified code is shown in Fig. 6 for each application. We used the CCCC tool<sup>5</sup> [39] on the accelerated part of each application and extracted two of its results: (1) The lines of code (without comments), which can be an indication for the effort of writing and reading a piece of code. (2) McCabe's cyclomatic complexity, which counts the number of linearly independent paths through a piece of code, and which can be an indication for the effort of understanding and verifying a piece of code. The results show three coarse categories of applications: First, the six applications on the left are tiled in a single dimension, which is a modest effort: the lines of code increase by 1.7 to  $2.5\times$ and the cyclomatic complexity increases by 1.3 to  $1.5\times$ . On average, the lines of code overhead by 1D tiling is 2× and the cyclomatic complexity increase is 1.4×. Second, darknet with its CNN layers is implemented with two-dimensional tiling and DMA transfers. 2D tiling substantially increases both the cyclomatic complexity  $(3.7\times)$  and the lines of code  $(3.4\times)$ . Third, covar is also implemented with 2D tiling, but the implementation is additionally split over two separate iterations through the entire data. This means the ca.  $3 \times$  lines of code overhead by 2D tiling incurs twice, leading to a total 6.3× lines of code overhead, while the cyclomatic complexity increases by the same factor as for darknet. In summary, the additional effort and maintenance cost for tiling an algorithm ranges from modest (1.7 $\times$  LOC, 1.5 $\times$  cyclo. compl.) to very high (6.3 $\times$  LOC, 4.0 $\times$  cyclo. compl.) and is certainly not negligible on average (2.6× LOC, 1.8× cyclo. compl.).

OpenMP assumes a cache-based memory hierarchy, leading to low performance on SPM-based memory hierarchies if a program is not manually tiled. To save these substantial manual tiling efforts, an optimal solution would be if the toolchain could automatically transform the untiled algorithm code to manage the memory hierarchy. The *AutoDMA* feature, introduced in § 2.2, brings this to HEROv2. Effectively, this means that the software-managed memory hierarchy of HEROv2 can be programmed as easily as a cache-based system.

The speed-up of compiler-generated and handwritten tiled code over unmodified OpenMP code is shown in Fig. 7. While the handwritten tiled code has a significantly higher complexity than the unmodified OpenMP code, as



Figure 7. Speed-up of execution on local memory with *compiler-generated* tiling and DMA transfers compared to execution on external main memory. 8 accelerator threads. The light blue bars show the speed-up of the handwritten implementation for comparison; the dark blue bars show the speed-up achieved by the compiler.

shown in Fig. 6, compiler-generated tiling requires zero code changes. The benchmarks in Fig. 7 can be divided into two categories: For covar and atax, the speed-up achieved by the compiler is marginal. For all other benchmarks, the speed-up achieved by the compiler is comparable to that of handwritten code. The benchmarks in the latter category feature large segments of contiguous memory accesses (spatial locality), and achieve on average 85% of the speed-up of handwritten code. The remaining 15% come from leveraging programmer insights (i.e., information not expressed in the code) to reduce the number of reconfigurations of the DMA engine: The handwritten code transfers multiple rows of matrices at once, possible by the understanding that the first element of row N + 1 is next in memory to the last element of row N. The compiler was not able to reconstruct this information, due to array-to-pointer decay in which the dimensions of data structures are lost. Without this information, the compiler considers multiple rows as non-contiguous and initiates a new DMA burst for each row, which adds an overhead compared to the single DMA burst in the handwritten code. Nonetheless, AutoDMA provides a speed-up of up to 4.4× without any code changes. Expert programmers still have the option to turn this feature off and implement tiling manually to extract the last tens of percent of performance.

For two benchmarks (covar and atax), the compilergenerated code cannot compete with the handwritten code. This can also be attributed to memory access patterns: a significant part of memory accesses are performed columnwise, i.e., in non-contiguous blocks. This effect is aggravated by the tile shape selected by the compiler, which inadvertently maximizes the number of column-wise accesses per tile, rather than contiguous row-wise accesses. This is due to the loop ordering of the benchmarks, which the AutoDMA feature does not rewrite<sup>6</sup>. Spatial locality is also important for performance on cache-based systems, but the issue is aggravated on HEROv2 where the DMA engine in this case is used to transfer individual words. As such, it is an extreme case of the overhead discussed for the previous category, where the compiler could not find sufficiently large chunks of contiguous memory. Despite these problems, the performance with AutoDMA is on-par to up to 20% higher than the OpenMP baseline, due to the high bandwidth of the DMA engine.

6. Tools that reorder loops, such as polyhedral analyses and transformations [40], could be used to preprocess the code, or the benchmarks could be manually rewritten using classical spatial locality optimizations.

In summary, for programs with high spatial locality, HEROv2's AutoDMA feature provides performance comparable to handwritten code, without the need for explicit tiling and DMA transfers. This reduces the execution time of unmodified OpenMP programs by up to  $4.4\times$  on software-managed memory hierarchies, achieving 85% of the speed-up of handwritten code. This makes software-managed memory hierarchies as easy to program as their hardware-cachebased counterparts. Similarly to hardware-managed caches, AutoDMA provides no significant improvements for programs with low spatial locality.

HEROv2 is a unique platform to analyze, develop, and optimize such compiler and runtime techniques, because it allows executing real applications and reference benchmarks on the actual RTL logic of a heterogeneous SoC, and because all its hardware and software components are open-source and permissively licensed.

## 3.3 System Architecture-Level Case Study

Our third case study examines the impact of an architectural design decision: How does the data width of the accelerator into the shared interconnect and main memory influence the performance of accelerated applications? To answer this question, we customize the on-chip network of the accelerator once to half the data width (32 bit) and once to twice the data width (128 bit) and remeasure our applications.

Fig. 8 shows the speed-up (for values > 1) or slow-down (for values < 1) for an accelerator on-chip network data width of 32 bit (left three bars of each application) and 128 bit (right three bars) compared to 64 bit. The leftmost bar in each group of three bars compares the cycles spent on DMA transfers: For most applications, halving the data width of the on-chip network results in a speed-up of 0.5×, and doubling the data width results in a speed-up of  $2\times$ , as expected. The exception, however, is darknet, with  $0.6 \times$  for half the data width and  $1.5 \times$  for double the data width. covar and darknet are the only application to use two-dimensional DMA transfers, which are composed of many relatively short bursts. This transfer pattern does not fully saturate the given on-chip network, which results in lower speed-ups for wider data widths. That is an important insight for optimizing the onchip network if DMA performance was critial for application performance. However, as we know from the application-level case study (§ 3.1), DMA transfers only account for at most 11.9% (average: 2.4%) of the application cycles. The majority of cycles is spent in computations, and the middle of each three bars compares the cycles spent on computations: Surprisingly, the data width of the on-chip network also has a significant impact on them. For 32 bit, the fetch bandwidth of instructions into the cache is halved, which leads to more instruction fetch stall cycles and reduces computational performance. For 128 bit, the fetch bandwidth for instructions could be doubled, but the instruction cache can only fetch at most 64 bit per cycle, so that has no impact. To accommodate the wider memory interface of the DMA engine, the tightly-coupled data memory (TCDM) interconnect in the accelerator cluster has to be changed from  $14 \times 16$  to  $18 \times 32$ . This configuration causes on average 15 % more contention on the TCDM despite the higher number of banks. A careful realignment of the cores on the TCDM interconnect could alleviate this, but the gist is that a wider accelerator on-chip network does not automatically



Figure 8. Speed-up of execution with an accelerator on-chip network data width of 32 bit or 128 bit compared to the default of 64 bit. Each application has six bars: The three bars on the left are for 64 bit data width, the three bars on the right are for 128 bit data width. Of each three bars, the left shows the speed-up of only the DMA cycles, the middle of only the computation cycles, and the right of the total cycles.

increase performance. In fact, as the rightmost bar of each application shows, application performance decreases by  $10\,\%$  on average if the design of the cluster is not simultaneously adapted.

Such insights from fully measured application executions are central for making substantiated decisions on the system architecture and for prioritizing engineering efforts. The closer the measured prototype is to the final design, the higher the quality of the measurements. Effects such as those discussed in this section would be extremely difficult to model with a simulator, as they depend on fine-grained interaction between several hardware components. Capturing this interaction quantitatively with non-cycle-accurate architectural simulation is a very intricate and error-prone task. Thus, an application-programmable heterogeneous research platform with a complete hardware and software stack, such as HEROv2, is a key enabler for architecture-level performance exploration.

#### 3.4 Accelerator ISA-Level Case Study

Specialized instructions are an important part of many domain-specific accelerators. They are often designed and evaluated in an instruction set simulation (ISS) or in RTL simulations. The drawback of ISS is that it is inaccurate as performance model because it does not capture microarchitectural effects. RTL simulation models the microarchitecture accurately, but it is only feasible for small data set and does not take communication outside the accelerator, which influences the memory subsystem and thereby the execution of the accelerated kernel, into account. Thus, a heterogeneous research platform is required to quantify the impact of specialized accelerator ISA extensions in heterogeneous computing with real-world data sets.

In this case study, we answer the question "How much do instructions from the Xpulpv2 ISA extension speed up execution of heterogeneous applications compared to the standard rv32imafc ISA?". As described in § 2.2, we have extended the RISC-V LLVM backend to automatically emit Xpulpv2 instructions during machine code generation. The evaluated kernels process data at full precision (i.e., 32-bit integers or floats) and therefore cannot make use of the quarter- or half-precision packed single instruction multiple data (SIMD) instructions, which would offer a significant speed-up for reduced-precision processing.

The speed-up of the Xpulpv2 ISA extension over the standard RISC-V RV32IMAFC ISA is shown in Fig. 9. We measure the total accelerator cycles with handwritten DMA transfers and 8 accelerator threads. As the first bar of each application shows, simply enabling Xpulpv2 provides a speed-up of  $1.5\times$  on average. Starting with gemm as an example, we



Figure 9. Speed-up of execution with custom instructions (Xpu1pv2) compared to the standard RISC-V (RV32IMAFC) ISA. Execution on local memory with handwritten DMA transfers and 8 accelerator threads. For each application, the first bar shows the speed-up by Xpu1pv2 instructions without manual register promotion, the second bar by Xpu1pv2 instructions with manual register promotion, and the third bar by implementing the innermost loop with inline assembly instructions (including manual register promotion and Xpu1pv2 instructions).

find that the compiler replaces the inner two compute loops by hardware loops. This is optimal, as there is only hardware for two loops. The body of the innermost loop is halved from 10 instructions (2 loads, 4 additions, 2 multiplications, 1 store, and 1 branch) to 5 instructions (2 post-increment loads, 1 multiplication, 1 MAC, and 1 store), while the bodies of the outer levels stay mostly identical. Apart from the store, which could be hoisted out of the innermost loop by a memory-to-register optimization pass, the innermost loop is optimal, and it is also optimally scheduled. The resulting speed-up of 2.5× can be attributed to halving the instructions within the innermost loop (ca. 2× speed-up) and hardware loops as well as less instructions in the outer loops (ca. 0.5× speed-up). Manually hoisting the store out of the innermost loop significantly improves performance further. Again looking at gemm, this reduces the innermost loop from 5 to 4 instructions, and the resulting relative speedup of  $1.28 \times$  is aligned with the reduction in instructions. The same findings hold for 3mm and 2mm, and a comparison with an inline assembly implementation of the innermost loop reveals that the instructions generated by the compiler perform on-par or better than the expert-written instructions. However, some benchmarks behave quite differently: For conv2d, atax, and bicg, the Xpulpv2 ISA extension provides only between 10 to 50% of speed-up – both with compilergenerated instructions and with an expert-written inner loop body. There are two main reasons for this: First, the kernels are not as well suited for post-increment memory accesses

7. For gemm, the multiplication by  $\alpha$  could be hoisted out of the innermost loop for all data types where multiplication is distributive over addition. However, this is an algebraic transformation and does not apply to all data types (such as floats), so we do not consider it.

as the matrix-matrix multiplication kernels. For atax, the increment of one of the two loads in the innermost loop is too large to be used in post-increment. For conv2d, the 2D  $(3 \times 3)$  loads in the innermost loop leave some opportunities for post-increment loads, but they are complex to exploit. Even the expert-written instructions, which use as many postincrement accesses as possible, do not bring a significant speedup. Second, hardware loops are not inferred for the innermost loops. This could be because the innermost loop iterates over the rows in a tile, and the number of rows changes depending on the tile index. This does not fundamentally preclude the use of hardware loops, however, so it is a current compiler limitation. Finally, covar sees a very high speed-up with Xpulpv2, but only with manual memory-to-register promotion. This simple change in the code enables the compiler to infer a hardware loop. The instructions generated by the compiler substantially outperform the expert-written inner loop, due to better scheduling.

In summary, the Xpulpv2 ISA extension has the potential to significantly accelerate all kernels we evaluated, mainly through post-increment memory accesses and hardware loops. Especially the latter are not trivial for the compiler to generate in all cases, however, which currently leads to speed-ups between 1.1 to  $3.5\times$  (average:  $2.1\times$ ). While the impact of changes to the accelerator ISA could also be studied in isolation (e.g., in RTL simulation), evaluating within a heterogeneous prototype system (such as HEROv2) produces more representative results, because the balance and interaction between memory transfers from and to the shared main memory and computation are taken into account, and because a prototype running at tens of MHz makes it feasible to work with real-scale data sets.

# 4 RELATED WORK

Emulation systems on FPGAs or custom programmable logic are widely used to get cycle-accurate results at a clock frequency of multiple MHz and turn-around times of few hours to days. In the FPGA Architecture Model Execution (FAME) taxonomy [55], HEROv2 is a *Direct FAME* system, which are characterized by implementing the target system with a one-to-one correspondence in clock cycles on an FPGA. Commercial Direct FAME systems include Cadence Palladium [41], Siemens Veloce Strato [42], and Synopsys ZeBu [43]. Those systems are capable of emulating up to 20 billion ASIC gate equivalents (GEs) at up to 10 MHz and can cost millions of USD. HEROv2 can scale over multiple FPGAs with chip-to-chip FPGA Mezzanine Card (FMC) connections, which are supported by all of HEROv2's carrier silicon. Depending on the design, the system interconnect, the accelerator interconnect, or both can extend over multiple FPGAs through FMC and QSFP+ connections. HEROv2's currently largest carrier silicon, Xilinx' VCU128, offers ca. 40 million ASIC GEs and can communicate with other FPGAs at more than 650 Gbit/s. Depending on FPGA and configuration, HEROv2's clock frequency is between 20 to 100 MHz.

FPGA-based computer system emulators are common in industry and research. The following recent works are comparable with ours (see Table 3 for an overview and [57] for a broader survey of older approaches up to 2014): Open-Piton [44] is an open-source many-core research framework

that can be implemented on an FPGA. It comes with a cachecoherent on-chip network and by now supports four different processor cores [18], [45], among them CVA6 also supported by HEROv2. The most recent version of OpenPiton optionally includes an open-source GPGPU or Nvidia's deep learning accelerator (NVDLA), which can be programmed from Linux running on the processor cores. This recent developments allow using OpenPiton for research on heterogeneous computing, which is HEROv2's focus, but the full hardware-software stack integration of accelerators, from API to acceleratorspecific compiler backend, remains HEROv2's distinguishing feature. MEG [46] is a system emulation infrastructure for neardata processing implemented on an FPGA. It features four 64bit RISC-V Boom cores as host processor and a near-memory accelerator whose architecture and ISA are not specified. Like HEROv2, MEG features a Linux-booting host processor and is implemented on a VU37P, but unlike HEROv2, the focus is on near-memory accelerators that seem to have a fixed function, as accelerator programming, memory hierarchy, data transfers, and communication with the host are not discussed. DART [47] accelerates the simulation of on-chip networks by mapping them onto an FPGA. It provides programmability by decoupling the simulator architecture from the architecture of the simulated on-chip network. Similarly, DuCNoC [49] maps on-chip networks to the PL of a Zyng-7000 SoC. Like HEROv2, the on-chip network is highly configurable and modeled cycle-accurately at 10 MHz and more in DART and DuCNoC, but unlike HEROv2, the remainder of the computer system remains in a higher-level simulator that injects traffic into the on-chip network. Prasad et. al. [48] improve on DART by specializing the microarchitecture of on-chip network components to the target FPGA architecture, which reduces the required hardware resources by 70% and the average packet latency by 20 %. In contrast, HEROv2's components are not specialized to FPGAs, which means they consume more hardware resources than minimally required but also that they match an ASIC implementation cycle-by-cycle.

FireSim [50], [51] extends FPGA-based emulation to Amazon EC2 F1, a public cloud FPGA platform. On the FPGA of each instance, FireSim allows instantiating modules from the Chipyard [58] ecosystem (e.g., the 64-bit RISC-V Rocket core, a L2 cache, a network interface controller (NIC), and fixedfunction accelerators such as the Hwacha vector processor). Multiple instances are connected over the datacenter network and C++ simulation models to emulate datacenter clusters with multiple server nodes. Like HEROv2, FireSim comes with an OS-capable multi-core CPU, but unlike HEROv2, the focus is on datacenter clusters and networking instead of heterogeneous computing with different ISAs, data models, execution models, and memory subsystems. Centrifuge [52] extends FireSim with a flow that generates heterogeneous SoCs containing user-defined HLS accelerators together with a Linux driver for them. In contrast, accelerators in HEROv2 can be interfaced with user-space libraries or in heterogeneous OpenMP applications, but the accelerator software is not auto-generated.

Research platforms that combine HW and SW components are less common. OpenESP [17], [53] is a research platform for heterogeneous SoC design. It provides a methodology and components to integrate processors (among them CVA6 also supported by HEROv2) and HLS-generated accelerators with

| Platform                     | FAME | Emul. | HW        | Open-    | Heterogeneous SoC Features |               |                |             |     |
|------------------------------|------|-------|-----------|----------|----------------------------|---------------|----------------|-------------|-----|
|                              |      | Focus | Ecosys.   | Source   | Hosts                      | Accelerators  | Program. Model | Accel. Coh. | MDM |
| Cadence Palladium [41]       | 0    | Dig   | ©         | X        |                            |               |                |             |     |
| Siemens Veloce Strato [42]   | 0    | Dig   | ©         | X        | customer-provided          |               |                |             |     |
| Synopsys ZeBu [43]           | 0    | Dig   | ©         | X        | customer-provided          |               |                |             |     |
| OpenPiton [18], [44], [45]   | 0    | Many  | p         |          | T1, CVA6, PRV32, ao486     | MIAOW, NVDLA  |                | 4           | X   |
| MEG [46]                     | 0    | Mem   | p         |          | Boom                       | custom        | n/d            | n/d         | X   |
| DART [47], [48], DuCNoC [49] | 3    | Net   | d         | <b>X</b> |                            | n/a           |                | n/d         | X   |
| FireSim [50], [51]           | 1, 5 | Dig   | <b>2</b>  |          | Rocket                     | NVDLA, HLS    |                | n/d         | X   |
| Centrifuge [52]              | 1, 5 | HeSoC | <b>20</b> |          | Rocket                     | HLS           |                | 2, 4        | X   |
| OpenESP [17], [53]           | 0    | HeSoC | p         |          | CVA6, LEON3                | custom, HLS   | dl d           | 1, 2, 4     | X   |
| HERO [19], [54]              | 0    | HeSoC | <>, tb    |          | A9                         | **            | ۰, ط, ط        | 1, 2        | X   |
| HEROv2 [this work]           | 0    | HeSoC | 🚸, 🕼      |          | A53, CVA6                  | <b>♦</b> ♦, 🔘 | o, do, d       | 1, 2, 3     |     |

Table 3. Comparison of computer emulation systems on programmable logic devices. Legend: FAME: taxonomy numbers [55] | Emul. Focus: Digital hardware, Heterogeneous SoCs, Manycores, Near-Memory Processing, On-Chip Networks | HW Ecosys.: © commercial, as ported or developed by platform maintainers, PULP, OpenHW Group, Chipyard | Hosts: ARM Cortex-A9 or -A53, OpenHW Group CVA6, OpenSPARC T1, Berkeley out-of-order machine (Boom), Gaisler LEON3, PicoRV32 | Accelerators: PULP cluster, Osnitch cluster, MIAOW GPGPU, custom logic, NVDLA, HLS-generated | Programming Model: accelerator separately programmable through an OS driver or a user-space API, or unified host+accelerator programming with heterogeneous OpenMP applications, or n/d not defined | Accelerator Coherence Modes [56]: 1 non-coherent DMA, 2 LLC-coherent DMA, 3 coherent DMA, 4 coherent cache, or not defined (n/d) | MDM: mixed-data-model (e.g., 64-bit host + 32-bit accelerator) programming supported.

a 2D-mesh on-chip network. Like in HEROv2, the accelerators have a DMA engine and can share virtual addresses with a processor through an IOMMU and a Linux driver. Unlike in HEROv2, accelerators are not programmable with a fullfeatured standard ISA, and there is thus no OpenMP offloading support and no heterogeneous API, runtime libraries, and toolchain that span across host processors and accelerators. HEROv1 [19], [54] does provide the components that enable the evaluation of heterogeneous applications on a mixed-ISA computer, but its toolchain is fundamentally limited to 32-bit hosts and accelerators [24]. Additionally, it has no API that unifies programming over multiple accelerators; it features one host and one accelerator architecture, and hardware and software are tailored to those instead of being modular; and its on-chip network is limited to simple configurations (e.g., fixed 64-bit data width) and topologies (e.g., central crossbar), which do not meet the demands of modern heterogeneous computers.

Accelerators have been designed specifically for FPGAs. GRVI Phalanx [15] 32-bit RISC-V soft processor array that scales to more than 1000 cores on a Xilinx VU9P FPGA. 2GRVI Phalanx [59] extends that to more than 1000 64-bit RISC-V cores on a Xilinx VU37P. The DRAGON architecture [60] is a 64-bit custom-ISA cluster-based multiprocessor that scales to 144 cores on a Xilinx VU37P. In contrast, the accelerator in HEROv2 is not specialized for FPGAs but has identical RTL code as for ASIC tapeouts. Its components, from cores [20], [34] over the accelerator cluster [61] to the on-chip communication fabric [26] have been taped out in multiple ASICs.

Programming models targeting heterogeneous computing are manifold, and we refer to [62] for an overview. In OpenCL [63], an application on the host submits separately-written kernels to be executed on an accelerator to a command queue. OpenCL is imperative, meaning application programmers have to explicitly call functions to create buffers, transfer data, and start execution on an accelerator. SYCL [64] extends OpenCL by enabling single-source heterogeneous programming and C++ AMP [65] by relieving the programmers from explicit data transfers between host and device. oneAPI Data Parallel C++ (DPC++) [66] builds on SYCL to define functions that can be offloaded to devices, and an open-source LLVM implementation is in development.

OpenMP, supported natively by HEROv2, is declarative, meaning application programmers describe *what* they want to do (e.g., offload a code section with data to an accelerator) while the compiler and runtime libraries take care of *how* those actions happen. OpenACC [67] goes even further: its directives describe the properties of a program (e.g., a parallel loop with independent iterations), and the toolchain and runtime libraries specialize the program to an accelerator. In Clang, OpenACC is implemented by translation to OpenMP. Through this, HEROv2 also supports OpenACC. HEROv2's open-source LLVM-based toolchain will enable the community to construct complementary and alternative heterogeneous computing software stacks, while building on a solid open infrastructure.

Heterogeneous compilers have also been implemented by others. Intel offers an OpenMP offloading compiler for its Xeon Phi accelerators [68], which differ from the host CPU by accelerator-specific extensions. Those extensions are only available through the proprietary Intel compiler, whereas HEROv2's full toolchain is open source. Research works on GCC [69] were the first to provide an open-source heterogeneous OpenMP toolchain, but GCC's offloading compilation is fundamentally limited to the same data model (e.g., 32-bit) for host and accelerators [24]. Mixed-data-model heterogeneous compilation has been pioneered recently [24] with Clang/LLVM, and HEROv2 integrates that work into its toolchain.

# 5 CONCLUSION

HEROv2 is a full-stack open-source<sup>8</sup> research platform for state-of-the-art heterogeneous computing: HEROv2 provides all hardware and software required to develop, compile, and run single-source, single-binary heterogeneous applications and seamlessly offload and share data from an application-class 64-bit host to a programmable 32-bit parallel accelerator. Thus, HEROv2 enables effective and accurate research from applications and algorithms down to microarchitecture. Additionally, HEROv2 comes with a novel *AutoDMA* compiler plugin, which provides a solution to one of the most pressing problems of accelerators with software-managed memories:

without any code changes, AutoDMA tiles loops and infers DMA transfers, which leads to a speed-up of up to  $4.4\times$  without any code changes and in most cases is only  $15\,\%$  slower than a handwritten implementation, which requires  $2.6\times$  more code.

HEROv2 enables research in various domains, and we know of ongoing projects that use HEROv2 in high-performance computing, real-time processing, in-network processing, transprecision accelerators, and parallel programming. We expect future work to evolve in the directions of larger scale-out accelerators, mixed and finer-grained coherency domains, and novel virtualization and communication technologies. We are also working on a tape-out in a modern silicon technology.

## **ACKNOWLEDGMENTS**

The authors thank Koen Wolters for his work on the host software stack and the heterogeneous toolchain and Maxim Mattheeuws for porting darknet to HEROv2. This work has been partially funded by the ECSEL Joint Undertaking for the FRACTAL project under grant agreement no. 877056 and the Croatian-Swiss Research Programme (CSRP) for the Heterogeneous Computing Systems with Customized Accelerators (HCSCA) project under project no. 180625.

#### REFERENCES

- M. Horowitz, "Computing's energy problem (and what we can do about it)," in *IEEE International Solid-State Circuits Conference (ISSCC)*, 2014, pp. 10–14.
- [2] M. Zahran, "Heterogeneous computing: Here to stay," *Commun. ACM*, vol. 60, no. 3, p. 42–45, Feb. 2017. [Online]. Available: https://doi.org/10.1145/3024918
- https://doi.org/10.1145/3024918
  [3] W. J. Dally *et al.*, "Domain-specific hardware accelerators," *Commun. ACM*, vol. 63, no. 7, p. 48–57, Jun. 2020. [Online]. Available: https://doi.org/10.1145/3361682
- https://doi.org/10.1145/3361682
  [4] M. Ditty et al., "Nvidia's Xavier SoC," in IEEE/ACM HotChips 30, 2018. [Online]. Available: https://old.hotchips.org/hc30/1conf/1. 12\_Nvidia\_XavierHotchips2018Final\_814.pdf
- [5] P. Bannon et al., "Compute and redundancy solution for the full self-driving computer," in IEEE/ACM HotChips 31, 2019. [Online]. Available: https://old.hotchips.org/hc31/HC31\_2.3\_ Tesla\_Hotchips\_ppt\_Final\_0817.pdf
- [6] S. Arora et al., "AMD 7nm Ryzen 4000 APU Renoir," in IEEE/ACM HotChips 32, 2020. [Online]. Available: https://hotchips.org/assets/program/conference/day1/ HotChips2020\_Mobile\_Processors\_AMD\_Renoir.pdf
- [7] J. L. Hennessy *et al.*, "A new golden age for computer architecture," *Commun. ACM*, vol. 62, no. 2, p. 48–60, Jan. 2019. [Online]. Available: https://doi.org/10.1145/3282307
- https://doi.org/10.1145/3282307
  [8] A. Reuther *et al.*, "Survey of machine learning accelerators," in *IEEE High Performance Extreme Computing Conference (HPEC)*, 2020, pp. 1–12
- [9] C.-Y. Gui et al., "A survey on graph processing accelerators: Challenges and opportunities," *Journal of Computer Science and Technology*, vol. 34, no. 2, pp. 339–371, 2019.
- [10] R. Ubal et al., "Multi2Sim: A simulation framework for CPU-GPU computing," in 21st IEEE International Conference on Parallel Architectures and Compilation Techniques (PACT), 2012, pp. 335–344.
- [11] J. Power *et al.*, "gem5-gpu: A heterogeneous CPU-GPU simulator," *IEEE Computer Architecture Letters*, vol. 14, no. 1, pp. 34–36, 2015.
- [12] A. Butko et al., "Full-system simulation of big.LITTLE multicore architecture for performance and energy exploration," in 2016 IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSOC), 2016, pp. 201–208.
- [13] A. Akram *et al.*, "A survey of computer architecture simulation techniques and tools," *IEEE Access*, vol. 7, pp. 78120–78145, 2019.
- [14] Y. Lee et al., "An agile approach to building RISC-V microprocessors," IEEE Micro, vol. 36, no. 2, pp. 8–20, 2016.
- [15] J. Gray, "GRVI Phalanx: A massively parallel RISC-V FPGA accelerator accelerator," in 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2016, pp. 17–20.

- [16] A. Kamaleldin et al., "Towards a modular RISC-V based many-core architecture for FPGA accelerators," IEEE Access, vol. 8, pp. 148 812–148 826, 2020.
- [17] P. Mantovani *et al.*, "Agile SoC development with Open ESP," in 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD), 2020, pp. 1–9.
- [18] J. Balkind *et al.*, "OpenPiton at 5: A nexus for open and agile hardware design," *IEEE Micro*, vol. 40, no. 4, pp. 22–31, 2020.
- [19] A. Kurth et al., "HERO: An open-source research platform for HW/SW exploration of heterogeneous manycore systems," in Proceedings of the 2nd Workshop on AutotuniNg and ADaptivity AppRoaches for Energy Efficient HPC Systems, ser. ANDARE '18. New York, NY, USA: Association for Computing Machinery, 2018. [Online]. Available: https://doi.org/10.1145/3295816.3295821
- [20] F. Zaruba et al., "The cost of application-class processing: Energy and performance analysis of a Linux-ready 1.7-GHz 64-bit RISC-V core in 22-nm FDSOI technology," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 27, no. 11, pp. 2629–2640, 2019.
- [21] P. Vogel et al., "Efficient virtual memory sharing via on-accelerator page table walking in heterogeneous embedded SoCs," ACM Trans. Embed. Comput. Syst., vol. 16, no. 5s, Sep. 2017. [Online]. Available: https://doi.org/10.1145/3126560
- [22] S. Mach et al., "FPnew: An open-source multiformat floating-point unit architecture for energy-proportional transprecision computing," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 29, no. 4, pp. 774–787, 2021.
- [23] F. Conti et al., "XNOR neural engine: A hardware accelerator IP for 21.6-fJ/op binary neural network inference," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018.
- [24] A. Kurth et al., "Mixed-data-model heterogeneous compilation and OpenMP offloading," in Proceedings of the 29th International Conference on Compiler Construction, ser. CC 2020. New York, NY, USA: Association for Computing Machinery, 2020, p. 119–131. [Online]. Available: https://doi.org/10.1145/3377555.3377891
- [25] —, "Scalable and efficient virtual memory sharing in heterogeneous SoCs with TLB prefetching and MMU-aware DMA engine," in 2018 IEEE 36th International Conference on Computer Design (ICCD), 2018, pp. 292–300.
- [26] —, "Ân open-source platform for high-performance non-coherent on-chip communication," IEEE Transactions on Computers, 2021.
- [27] M. Cavalcante et al., "Design of an open-source bridge between non-coherent burst-based and coherent cache-line-based memory systems," in Proceedings of the 17th ACM International Conference on Computing Frontiers, ser. CF '20. New York, NY, USA: Association for Computing Machinery, 2020, p. 81–88. [Online]. Available: https://doi.org/10.1145/3387902.3392631
- [28] B. Forsberg *et al.*, "Heprem: A predictable execution model for gpubased heterogeneous socs," *IEEE Transactions on Computers*, vol. 70, no. 1, pp. 17–29, 2020.
- [29] OpenMP Application Programming Interface, Version 4.5, OpenMP Architecture Review Board, 2015.
- [30] W.-M. W. Hwu, Ed., Heterogeneous System Architecture. Morgan Kaufmann Publishers, 2016.
- [31] POSIX.1-2017 (IEEE Std 1003.1-2017), IEEE and The Open Group, 2018.
- [32] J. Herter, "Timing-predictable memory allocation in hard real-time systems," Ph.D. dissertation, Universität des Saarlandes, 2014.
- [33] P. Kirienko, "Constant-complexity deterministic memory allocator (heap) for hard real-time high-integrity embedded systems," 2020. [Online]. Available: https://github.com/pavel-kirienko/o1heap
- [34] M. Gautschi et al., "Near-threshold RISC-V core with DSP extensions for scalable IoT endpoint devices," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 10, pp. 2700–2713, 2017.
- [35] F. Zaruba et al., "Snitch: A tiny pseudo dual-issue processor for area and energy efficient execution of floating-point intensive workloads," IEEE Transactions on Computers, pp. 1–1, 2020.
- [36] S. Grauer-Gray et al., "Auto-tuning a high-level language targeted to GPU codes," in 2012 Innovative Parallel Computing (InPar). IEEE, 2012, pp. 1–10.
- [37] J. Redmon et al., "You Only Look Once: Unified, real-time object detection," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779–788.
- [38] P. J. Fleming *et al.*, "How not to lie with statistics: The correct way to summarize benchmark results," *Commun. ACM*, vol. 29, no. 3, p. 218–221, Mar. 1986. [Online]. Available: https://doi.org/10.1145/5666.5673

- [39] T. Littlefair, "An investigation into the use of software code metrics in the industrial software development environment," Ph.D. dissertation, Edith Cowan University, 2001.
- [40] T. Grosser et al., "Polly—performing polyhedral optimizations on a low-level intermediate representation," Parallel Processing Letters, vol. 22, no. 04, p. 1250010, 2012.
- [41] Palladium Z2 Enterprise Emulation Platform Enabling design teams to debug and accelerate verification of complex IPs to multi-billion-gate SoC designs, Cadence, 2021.
- [42] "Emulation a job management strategy to maximize use," Siemens, White Paper, 2021.
- [43] ZeBu EP1: Industry's Fastest Billion Gates Emulation System, Synopsys, 2021
- [44] J. Balkind et al., "OpenPiton: An open source manycore research framework," SIGPLAN Not., vol. 51, no. 4, p. 217–232, Mar. 2016. [Online]. Available: https://doi.org/10.1145/2954679.2872414
- [45] —, "BYOC: A "bring your own core" framework for heterogeneous-ISA research," in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS '20. New York, NY, USA: Association for Computing Machinery, 2020, p. 699–714. [Online]. Available: https://doi.org/10.1145/3373376.3378479
- [46] J. Zhang et al., "MEG: A RISCV-based system emulation infrastructure for near-data processing using FPGAs and high-bandwidth memory," ACM Trans. Reconfigurable Technol. Syst., vol. 13, no. 4, Sep. 2020. [Online]. Available: https://doi.org/10.1145/3409114
- [47] D. Wang et al., "DART: A programmable architecture for NoC simulation on FPGAs," IEEE Transactions on Computers, vol. 63, no. 3, pp. 664–678, 2014.
- [48] B. P. Prasad et al., "FPGA friendly NoC simulation acceleration framework employing the hard blocks," Computing, pp. 1–23, 2021.
- [49] H. M. Kamali et al., "DuCNoC: A high-throughput FPGA-based NoC simulator using dual-clock lightweight router micro-architecture," IEEE Transactions on Computers, vol. 67, no. 2, pp. 208–221, 2018.
- [50] S. Karandikar et al., "FireSim: FPGA-accelerated cycle-exact scale-out system simulation in the public cloud," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 2018, pp. 29–42.
- [51] —, "FirePerf: FPGA-accelerated full-system hardware/software performance profiling and co-design," in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS '20. New York, NY, USA: Association for Computing Machinery, 2020, p. 715–731. [Online]. Available: https://doi.org/10.1145/3373376.3378455
- [52] Q. Huang et al., "Centrifuge: Evaluating full-system HLS-generated heterogenous-accelerator SoCs using FPGA-acceleration," in 2019 IEEE/ACM International Conference on Computer-Aided Design (IC-CAD), 2019, pp. 1–8.
- [53] D. Giri et al., "Accelerator integration for open-source SoC design," IEEE Micro, vol. 41, no. 4, pp. 8–14, 2021.
- [54] A. Kurth et al., "HERO: heterogeneous embedded research platform for exploring RISC-V manycore accelerators on FPGA," in Proceedings of the First International Workshop on Computer Architecture Research with RISC-V, ser. CARRV '17, 2017.
- [55] Z. Tan et al., "A case for FAME: FPGA architecture model execution," SIGARCH Comput. Archit. News, vol. 38, no. 3, p. 290–301, Jun. 2010. [Online]. Available: https://doi.org/10.1145/1816038.1815999
- [56] J. Zuckerman et al., "Cohmeleon: Learning-based orchestration of accelerator coherence in heterogeneous SoCs," in MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO '21. New York, NY, USA: Association for Computing Machinery, 2021, p. 350–365. [Online]. Available: https://doi.org/10.1145/3466752.3480065
- [57] H. Angepat et al., "FPGA-accelerated simulation of computer systems," Synthesis Lectures on Computer Architecture, vol. 9, no. 2, pp. 1–80, 2014. [Online]. Available: https://doi.org/10.2200/ S00586ED1V01Y201407CAC029
- [58] A. Amid et al., "Chipyard: Integrated design, simulation, and implementation framework for custom SoCs," IEEE Micro, vol. 40, no. 4, pp. 10–21, 2020.
- [59] J. Gray, "2GRVI Phalanx: Towards kilocore RISC-V FPGA accelerators with HBM2 DRAM," in IEEE/ACM HotChips 30, 2019.
- [60] R. B. Abdelhamid et al., "A highly-efficient and tightly-connected many-core overlay architecture," IEEE Access, vol. 9, pp. 65277– 65292, 2021.

- [61] D. Rossi *et al.*, "Energy-efficient near-threshold parallel computing: The PULPv2 cluster," *IEEE Micro*, vol. 37, no. 5, pp. 20–31, 2017.
- [62] S. Mittal et al., "A survey of CPU-GPU heterogeneous computing techniques," ACM Comput. Surv., vol. 47, no. 4, Jul. 2015. [Online]. Available: https://doi.org/10.1145/2788396
- [63] The OpenCL C Specification v3.0.7, Khronos OpenCL Working Group, Apr. 2021.
- [64] SYCL 2020 Specification (revision 3), Khronos SYCL Working Group, Mar. 2021.
- [65] C++ Accelerated Massive Parallelism (C++ AMP), Microsoft Corp., Nov. 2018.
- [66] oneAPI Specification, Release 1.0-rev-3, Intel Corp., Nov. 2020.
- [67] The OpenACC Application Programming Interface v3.1, The OpenACC Organization, Nov. 2020.
- [68] "Migrating offloading software to Intel Xeon Phi processor," Intel Corp., White Paper, Feb. 2018.
- [69] A. Capotondi et al., "Runtime support for multiple offload-based programming models on clustered manycore accelerators," IEEE Transactions on Emerging Topics in Computing, vol. 6, no. 3, pp. 330–342, 2018.



Andreas Kurth received his BSc and MSc degree in electrical engineering and information technology from ETH Zurich in 2014 and 2017, respectively. He is currently pursuing a PhD degree in the Digital Circuits and Systems group of Prof. Benini. His research interests include the architecture and programming of heterogeneous SoCs and accelerator-rich computing systems.



**Björn Forsberg** received his PhD from ETH Zurich in 2021, working on compiler and software-centric techniques for real-time execution on embedded heterogeneous systems. His interests lie at new and enabling technology at the hardware/software boundary, and its impact on programmability, real-time, and performance. He received his MSc degree from Uppsala Universitet in 2015.



**Luca Benini** (F'07) holds the chair of Digital Circuits and Systems at ETH Zurich and is Full Professor at the Università di Bologna. Dr. Benini's research interests are in energy-efficient computing systems design, from embedded to high-performance. He has published more than 1000 peer-reviewed papers and five books. He is a Fellow of the ACM and a member of Academia Europaea.